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Abstract 

Biological systems display impressive capabilities in effectively responding to envi- 
ronmental signals in real time. There is increasing evidence that organisms may indeed 
be employing near optimal Bayesian calculations in their decision-making. An intriguing 
question relates to the properties of optimal encoding methods, namely determining the 
properties of neural populations in sensory layers that optimize performance, subject 
to physiological constraints. Within an ecological theory of neural encoding/decoding, 
we show that optimal Bayesian performance requires neural adaptation which reflects 
environmental changes. Specifically, we predict that neuronal tuning functions pos- 
sess an optimal width, which increases with prior uncertainty and environmental noise, 
and decreases with the decoding time window. Furthermore, even for static stimuli, 
we demonstrate that dynamic sensory tuning functions, acting at relatively short time 
scales, lead to improved performance. Interestingly, the narrowing of tuning functions 
as a function of time was recently observed in several biological systems. Such results 
set the stage for a functional theory which may explain the high reliability of sensory 
systems, and the utility of neuronal adaptation occurring at multiple time scales. 

Keywords: Tuning functions, neural encoding, population coding, Bayesian decoding, op- 
timal width. Fisher information. 
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1 Introduction 



Behaving; organisms ofte n exhibit optim a l, or n early-optimal performance (e.g. jjacobsl (Il999[ ): 
Ernst and Banksl (120021 ): [Jacobs et al.l ( 120091 ) ). despite physiological constraints and noise 
which is inherent to both the environment and the neural activity. Experimental work 
performed over the past few years has suggested that one means by which such effective 
behavior is achieved is throu gh adaptation to t he environ ment occurring a cross multiple 



time scales (e.g. m il liseco nds - 1 Wang et al.l ( 120051 ) , seconds - iDean et al.l . l2008l and minutes 



Pettet and Gilbert A major challenge pertaining to such results relates to the con- 

struction of a coherent theoretical framework within which the efficacy of such adaptation pro- 
cesses can be motiv ated and assessed. W e believe that a carefully articulated reverse engineer- 
ing perspective (see lMarom et al.l (120091 )). based on the utilization of well founded engineering 
principles, can yield significant insight about these phenomena. More specifically, such ap- 
proaches can explain how the environmental state may be effectively estimated from the neu- 
ral spike trains and from prior knowledge through a process of neural decoding. Within this 
context, many psychophysical and neurophysiological results can be explained by Bayesian 
models, suggesti ng that organisms rr i ay indeed be employing Bayesi a n cal c ulations in their 



decision-making (iKnill and Richards! (Il996l ): iKnill and Pougetl (120041 ): iRaol (l2004f ): lMa et al. 
(I2OO6I )] 

static ( 



Optimal r e al-time decoding me t hods have been proposed over t h e past decade fo r 
Zemel et all (Il998l): IPeneve et al.| (I2OOII): IPouget et all (l2002l.l2003l):lMa et"al1 (|2006 |) : 



Averbeck et al.l (120061): iBeck et al.l (12(3081)') a nd dynamic (iTwum-Danso and BrockettI (120011): 
Eden et al.l (l2004l): iBeck and Pougetl ( I2007I ): IHuvs et al.l (l2007f l: IPitkow et al.l (l2007t ): IPeneve 
( 2008[ ): Bobrowski et al. ( 2009[ )) environments. While much of this work has been performed 
in the context of sensory sys tems, it has also been widely used in the context of higher brain 
regions (e.g., hippocampus - Barbieri et al. (2004); Eden et al. ( 20041 )). and may therefore be 
relevant across multiple physiological levels. 

Given the well developed theory of Bayesian decoding, an intriguing follow-up question 
relates to the properties of optimal encoding methods, namely determining the properties of 
a neural population in the sensory layer that optimizes performance, subject to physiological 
constraints. Sensory neurons are often characterized by t heir tuning functions ( sometimes re- 
ferred t o as "tuning curves" - e . g. I Anderson et al. (l2000l):lB renne r et all (I2OOOI ): IPra goi et al.l 



(l2000[ ): lHarper and McAlpind ( l2004h : lKorte and Rauschecker. (19931)), which quantify the re- 
lationship between an external stimulus and the evoked activities of each neuron, typically 
measured by the probability, or frequency, of emitting spikes. From an ecological point 
of view it is speculated that optimal tuning functions are not universal, but rather adapt 
themselves to the specific context and to the statistical nature of the environment. The neu- 
rophysiological literature offers much experimental evidence for neurons in many areas, which 
respond to changes in the statistical a ttributes of stimuli by modulating their re s ponse prop 



erties ( e.g. iPettet and GilbertI (119921 ) : lBrenner et all (120001 ) : lDragoi et al.l (l2000[ l: lDean et al. 
(l2005l ): lHosova et al.l (120051 )1 



Theoretical studies of optimal tuning functions hinge on the notion of optimality, which re- 
quires the definition of an appropriate cost function. Arguably, a natural cost function is some 
measure of distance between the true and the estimated environmental state. In this context 
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a reasonable choice is the mean squared error (MSE), or the average Euchdean distance be- 
tween the two signals. While other distance measures can be envisaged, and should indeed 
be studied in appropriate contexts, there are many advantages to using this specific measure, 
not least of which is its relative analytic tractability. The optimality criterion then becomes 
minimal MSE (MMSE), where it is assumed that the spike trains generated by the sensory 
cells are later processed by an optimal decoder. Unfortunately, even in this relatively simple 
setting, analytical examination of the dependence of the MMSE on the tuning functions is in- 
feasible in the general case, and many researchers draw conclusions about optimality of tuning 



mation 
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(1993 
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'oueet et al. 


(1999 
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Zhane and Seinowski (1999): 


Tovoizumi et al. ( 


2006): Johnson and Rav 
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(2004): 


1/anskv and Greenwood 


^005) 


Brown and Backer ( 


2006f)'). Unfortunatelv. in manv 



cases reaching conclusions based on, often loose, bounds can lead to very misleading results, 
which stand in distinct opposition to predictions arising from analyzing the MMSE itself (see 
Section [3l). Along differen t lines, other r e searchers consider information theor e tic quantities 



Brunei and Nadal (1998): 



Panzeri et all (Il999f l: McDonnell and Stocks! (120081 ) : lNikitin et al. 



(120091 ): iGeisler et al.l (120091 )). attempting to determine conditions under which the maximal 
amount of information is conveyed, subject to various constraints. However, a direct relation 
(as opposed to bounds) between such q uantities and the MSE has only been established in 
very specific situations (IDuncanl (Il970[): iGuo et al. (|2005()') and does not ho ld in general. A 



notable exception is the work of iBethge et al.l (120021 ) and lBethge et al.l (120031 ) . who employed 



Monte Carlo simulation s in order to assess the MMSE directly in specific scenarios; see also 



Bobrowski et al.l (120091 ). who computed the MMSE analytically to find optimal tuning func- 



tions in a concrete setting. 



In this paper we directly address the issue of MMSE-optimal tuning functions, using 
well-justified approximations which lead to explicit analytic results. We examine various sce- 
narios, including some that were not previously treated, and make novel predictions that can 
be tested experimentally. We begin in Section 12711 by analyzing optimality in terms of Fisher 
information and demonstrate why drawing qualitative and quantitative conclusions based on 
bounds can be misleading. In fact, bound-based predictions can sometimes be diametrically 
opposed to the predictions based on the true error. This notion is of great importance due 
to the very prevalent use of approximations and bounds on the true error. We then move 
in Section I2l2l to discuss the implications of directly minimizing the MMSE, focusing on the 
effects of noise and multimodality. The advantages of dynamic real-time modification of tun- 
ing function properties are analyzed in Section I2l3| and concrete experimental predictions are 
made. In fact, some of these predictions have already been observed in existing experimental 
data. Specifically, we predict that neuronal tuning functions possess an optimal width, which 
increases with prior uncertainty and environmental noise, and decreases with the decoding 
time window. The results of the paper are summarized and discussed in Section |3l and the 
mathematical details appear in Section HI 
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2 Results 



We investigate the problem of neural encoding and decoding in a static environment. More 
formally, consider an environment described by a random variable X, taking values in some 
space X, and characterized by a probability density function p{-) (in order to simplify no- 
tation, we refrain from indexing probability distributions with the corresponding random 
variable, e.g., Pxi'), as the random variable will be clear from the c o ntext ). In general, 
X may represent a stochastic dynamic process (e.g., iBobrowski et al. but we limit 



ourselves in this study to the static case. In typical cases X may be a random vector, e.g., 
the spatial location of an object, the frequency and intensity of a sound and so on. More 
generally, in the case of neurons in associative cortical regions, X can represent more abstract 
quantities. Suppose that X is sensed by a population of M sensory neurons which emit spike 
trains corresponding to the counting processes Nj = {A'^^, . . . ,N^ ^}, where iV™ represents 
the number of spikes emitted by cell m up to time t. Denote by Xm{-) the tuning function 
of the m-th sensory cell. We further assume that, given the input, these spike trains are 
conditionally independent Poisson processes, namely 

iVf^lX ~ Pois(A„,(X)t) (m = l,...,M), 

implying that P{NJ^ = n\X) = e'^""^^^^ {Xm{X)t)"' / n\ (in a more general case of dynamic 
tuning functions {Xm{t,-)}, with which we deal late r, the parameter of the Poisson distri- 
bution is Jq Xm{s, X)ds - see IBobrowski et al. Following this encoding process, the 



goal of an optimal decoder is to find the best reconstruction of the input X, based on obser- 
vations of the M spike trains N^. In this paper we focus on the converse problem, namely 
the selection of tuning functions for encoding, that facilitate optimal decoding of X. 

We formulate our results within a Bayesian setting, where it is assumed that the input 
signal X is drawn from some prior probability density function (pdf) p{-), and the task is to 
estimate the true state X, based on the prior distribution and the spike trains observed up 
to time t. For any estimator X, we consider the mean squared error MSE(X) = E[(X — X)^], 
where the expectation is taken over both the observations and the state X. It is well known 
that X°P*, the es timator minimizing the MSE, is given by the conditional mean E[X|Nt] (e.g.. 



van Treed (Il968l )). which depends on the parameters defining the tuning functions {Xm{x)}. 



As a specific example, assuming that the tuning functions are Gaussians with centers 
Ci, . . . ,Cm and widths ai, . . . , Om, the estimator X, and the corresponding minimal MSE, 
referred to as the MMSE, depend on = [ci, . . . ,CM,cti, ■ ■ ■ ,ctM)- The optimal values of the 
parameters are then given by a further minimization process, 

6>°P* = argmin{MMSE(6>)}. 
e 

In other words, 6°^^ represents the tuning function parameters which lead to minimal recon- 
struction error. In this paper we focus on fixed centers of the tuning functions, and study 
the optimal widths of the sensory tuning functions that minimize MMSE(0). 
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Since it is often difficult to compute MMSE(0), an alternative approach which has been 
widely used in the neural computation literature is based on minimizing a more tractable 
lower bound on the MSE, an issue which we now turn to. 



2.1 Fisher-optimal width 



In a Bayesian context, it is well known (Ivan Treesl . ll968l ) that for any estimator X, the MSE is 



lower bounded by the Bayesian Cramer- Rao lower bound (BCRB) given in ([7]). An interest- 
ing question is the following: considering that the MSE of any estimator (and thus the MMSE 
itself) is lower bounded by the BCRB, do the optimal widths also minimize the BCRB? In 
other words, can the BCRB be used as a proxy to the MMSE in selecting optimal widths? If 
this were the case, we could analyze MMSE-optimality by searching for widths that maximize 
the expected value of the Fisher information defined in ([6]). This alternative analysis would 
be favorable, because in most cases analytical computation of E[i7'(X)] is much simpler than 
that of the MMSE, especially when the conditional likelihood is separable, in which case the 
population's log likelihood reduces to a sum of individual log likelihood functions. Moreover, 
it remains analytically tractable under much broader conditions. Unfortunately, as we show 
below, the answer to the above question is negative. Despite the fact that for any encoding 
ensemble the BCRB lower bounds the MMSE, and even approximates it in the asymptotic 
limit, the behavior of the two (as a function of tuning function widths) is very different, as 
we demonstrate below. 

In terms of the BCRB the existence of an optimal set of widths critically depends on the 
dimensionality of the environmental state. We consider first the univariate case, and then 
discuss the extension to multiple dimensions. 



2.1.1 The univariate case 



In the scalar case, using the tuning functions in ([2]), it follows from that it is best to 
employ infinitesimally narrow tuning functions, since E[J7'(X)] — oo most rapidly when 
ttm — ^ for all m, although Fisher information is undefined when one of the width param- 
eters is e xactly 0. A similar resu l t was s uggested previously for non- B ayesian estimation 
schemes ( ISeung and SompolinskyI (119931 ); IZhang and Sejnowskil ( 119991 ); iBrown and Backer 
( 2006[ )). but within the framework that was employed there the result was not valid, as we 
elucidate in Section [31 Why does Fisher information predict that "narrower is better"? As 
a tuning function narrows, the probability that the stimulus will appear within its effective 
receptive field and evoke any activity decreases, but more and more information could be 
extracted about the stimulus if the cell did emit a spike. Evidently, in the limit — > the 
gain in information dominates the low probability and BCRB — t- 0; however, this is precisely 
the regime where the bound vanishes and is therefore trivial. This result is in complete 
contrast with the MSE, which is minimal for non-zero values of a as we show below. More 
disturbingly, as argued in Section 12.2.11 the performance is in fact the worst possible when 
a — 7- 0. We note that the inadequ acy of conclusions dr awn from lower bounds was addressed 
using Monte Carlo simulations by lBethge et al.l (j2002[ ). Unfortunately, as we explain in Sec- 
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tion 14. 2^ the asymptotic approximation used in iBethge et al.l (|2002[ ) for the MMSE may not 
hold in the Bayesian setting. 



2.1.2 The multivariate case 

In this case, using on the (i— dimensional Gaussian tuning functions ([5]) characterized by the 
matrices , the problem becomes more complex as is clear from ( llOp . because the tuning 
functions may have different widths along different axes or even non-diagonal matrices. For 
simplicity we focus on the case where the matrices Am are diagonal. By definition, the MSE 
of any estimator is the trace of its error correlation matrix, and in accordance with ([9]), 
Fisher-optimal tuning functions minimize the trace of J~^, which equals the sum of its eigen- 
values. If the widths in all dimensions are vanishingly small [Am = el where e ^ 1), Am 
becomes negligible with respect to S^, and E[j7'(X)] ^ -^max^ Xlm I^^I^^^^V^P^^m^- 
In 3D and in higher dimensions the matrix \ Am \ A^ = diag close 
to the zero matrix, and consequently the eigenvalues of E[i7(X)] are vanishingly small. 
If all widths are large {Am = (l/^)! where e <^ 1), Am. dominates S^. and E[j7'(X)] ^ 
^maxtJ2^~^^"^^rn[^x + (Cm — A*x)(Cm — Mx)^]) namely the eigenvalues of E[J7'(X)] are still 
vanishingly small. This means that when the widths are too small or too large, the BCRB 
dictates poor estimation performance in high dimensional settings (specifically, the informa- 
tion gain no longer dominates the low probability in the limit of vanishing widths). Therefore, 
there exists an optimal set of finite positive widths that minimize the BCRB. As an excep- 
tion, in 2D the eigenvalues of y^jA^A"^ are finite when the widths approach 0, and E[j7'(X)] 
cannot be said to have infinitesimally small eigenvalues. In fact, numerical calculations for 
radial prior distribution reveal that in 2D "narrower is still better", as in the univariate case. 

An interesting observation concerning the BCRB based optimal widths is that they cannot 
depend on the available decoding time t, since E[j7'(X)] is simply proportional to t. As we 
will show, an important consequence of the present work is to show that the optimal tuning 
function widths, based on minimizing the MMSE, depend explicitly on time. In this sense, 
choosing optimal tuning functions based on the BCRB leads to a qualitatively incorrect 
prediction. We return to this issue in Section [3l where further difficulties arising from the 
utilization of lower bounds are presented. 

2.2 MMSE-based optimal width 

Having discussed predictions about optimal widths based on BCRB minimization, we wish 
to test their reliability by finding optimal widths through direct MMSE minimization. As 
explained in Section HI in order to facilitate an analytic derivation of the MMSE, we examine 
here the case of dense equally spaced tuning functions with uniform widths. We consider 
first the univariate case and then proceed to the multivariate setting. 
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Figure 1: (A) The normalized MMSE of a dense population of Gaussian tuning functions 
against tuning functions width, for different values of effective decoding time (shortest - 
uppermost triplet, longest - lowermost triplet). In each triplet 3 values of prior standard 
deviation were examined - ax = 1 (solid), ax = 2 (dashed) and cr^ = 3 (dotted), where the 
optimal width is indicated by squares, circles and diamonds, respectively. (B) The normal- 
ized MMSE for different values of effective decoding time [ax = 1), obtained theoretically 
(dashed lines, optimal width indicated by circles) and in simulation (solid lines, optimal 
width indicated by squares). 



2.2.1 The univariate case 

Based on f[T^ . we plot the normalized MMSE as a function of the width (figure [T]( A)) for 
different combinations of effective decoding time and prior standard deviation. The effective 
time, teE, defined in Section 14.31 is proportional to the time over which spikes accumulate. 
The existence of an optimal width, which is not only positive b ut also varies with cr^ an d tes, 



is clearly exhibited. A similar result was first demonstrated by lBobrowski et al.l (120091 ). but 
the dependence of optimal width on the two other parameters was not examined. Note that 
the derivation in (I12p is not reliable in the vicinity of the y-axis, because the approximation 
of uniform population firing rate (see Section 04. 3p ) is not valid when a — )■ 0. Nevertheless, 
it can be proved that when a — )■ 0, P(Nt = 0) — ?■ 1 and as a consequence X°P*(Nt) — )■ fix 
and indeed MMSE/cr^ 1. 



The existence of an optimal width is very intuitive, considering the tradeoff between mul- 
tiplicity of spikes and informativeness of observations. When tuning functions are too wide, 
multiple spikes will be generated that loosely depend on the stimulus, and thus the obser- 
vations carry little information. When tuning functions are too narrow, an observed spike 
will be informative about the stimulus but the generation of a spike is very unlikely. Thus, 
an intermediate optimal width emerges. To verify that the results were not affected by the 
approximation in (ITT]) , we compare in figure [HB) the theoretical MMSE with the MMSE 
obtained in simulations. As can be seen in the figure, the two functions coincide, and in 
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Figure 2: Optimal tuning functions width as a function of (A) effective decoding time, and 
(B) prior standard deviation. All curves are well fitted by the same function: a(tes, dx) = 



particular, the optimal widths are nearly identical . 

We now turn to analyze the effect of the two parameters, and teS, in (fT2|) . The ef- 
fective time is a measure for the number of spikes generated by the neural population, since 
it is proportional to Amax^ (where Amax is the maximal mean firing rate of a single neuron) 
and also to 1/Ac (which indicates the population size when the population is required to 
"cover" a certain fraction of the stimulus space). Longer effective times increase the likeli- 
hood of spikes under all circumstances, and therefore the drawback of narrow tuning functions 
is somewhat mitigated, leading to a reduction in optimal width (as illustrated in figure [2]^ A)). 

The prior standard deviation reflects the initial uncertainty in the environment: the larger 
it is - the less is known a-priori about the identity of the stimulus. Confidence about stimulus 
identity (associated with small MMSE) may stem from either a deterministic environment or 
from numerous observations for which the likelihood is very sharp. When the environment 
is less certain, many more observations are needed in order to obtain a narrow posterior 
distribution leading to smaller MMSE, and thus the optimal width increases (as illustrated 
in figure [2]^ B)). Interestingly, the optimal width is very well fitted by a simple expression of 
effective time and initial uncertainty: 



We comment in passing that the dimension of tes is inverse length. From ([T]) we see that 
lim^^o Q^°^* = CTx, a result which can be obtained analytically as follows. When t — )■ the 
Poisson random variable Y = N]^ converges in probability to a Bernoulli random variable 
with "success" probability ates, in which case the expression for the normalized MMSE ( fT2i) 



1 




(1) 
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greatly simplifies, 



MMSE 1 - ateff ates + (1 - ates) a. 



al 1 l + al/a^ 

Differentiating with respect to a and setting the result to yields 



When the width equals the prior standard deviation, the normalized MMSE can be calculated 
exactly for any effective time. 



MMSE 



. . k k\ 

k=0 

^ (e-^*eff _ 1) = ' ^ 



. _ ik + l] 



which for very short time windows converges to the optimal normalized MMSE. 



2.2.2 The multi-dimensional case 

We start by considering a radial prior distribution and radial tuning functions, namelyS^. = 
a^l and Am = a^I, where I is the dx d identity matrix. In this simple scenario the posterior 
covariance matrix in (fT^ becomes 



-1 



where Y = Y^m^r ~ Pois(a'^tfg) = Pois(atefr(A/27ra/Ac)'^-^) . From <^ we see that 



MMSE 



at 



1 



1 + Yal/a' 



As long as ylna""^^ / > 1 the parameter of the Poisson random variable Y increases with 
the dimensionality D, in a manner that is equivalent to longer effective time (albeit width- 
dependent). Consequently, recalling ([1]), the optimal width decreases with the dimensionality 
of the stimulus. Numerical calculations show that indeed this is the case in 2D and 3D for 
eff ective times which are not ex tremely long. This is in contrast with the predictions made 
by IZhang and Sejnowskil ( 1l999l ) , where it was speculated that performance is indifferent to 
the width of the tuning functions in 2D, and improves with infinitely increasing width in 
higher dimensions. 

When the prior covariance matrix and tuning functions shape matrix A are diagonal with, 
possibly different, elements {o''^d}d=i {'^d}£=i' respectively, the MMSE is given by 

D 



MMSE = 



-2 
'-'x,d 



a-:'Y 
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where Y = "^^NJ^ ~ Poisl^rfflldLi Oid)- Since the random variable Y appears in every term 
in the sum, combined with different prior variances, the optimal width vector will adjust 
itself to the vector of prior variances, namely the optimal width will be largest (smallest) in 
the dimension where the prior variance is largest (smallest). 

An important feature of the theoretical results is the ability to predict the qualitative be- 
havior that biological tuning functions should adopt, if they are to perform optimally. From 
([1]) we see that if the environmental uncertainty (expressed by a^) decreases, the tuning 
functions width is expected to reduce accordingly (this holds in 2D as well). This predic- 
tion provides a possible theoretical explanat ion for some results obtained in psychophysical 
experiments (lYeshurun and Carrasco (jl999[ )). In these experiments human subjects were 



tested on spatial resolution tasks where targets appeared at random locations on a computer 
screen. In each trial the subjects were instructed to fixate on the screen center and the target 
was presented for a brief moment, with or without a preceding brief spatial cue marking the 
location of the target but being neutral with respect to the spatial resolution task. The au- 
thors found that the existence of the preceding cue improved both reaction times and success 
rates, concluding that spatial resolution was enhanced following the cue by reducing the size 
of neuronal receptive fields. We argue that the spatial cue reduces the uncertainty about the 
stimulus by bounding the region where the stimulus is likely to appear (i.e. ax is reduced). 
In light of our results, an optimal sensory system would then respond to the decrease in prior 
standard deviation by narrowing the tuning functions towards stimulus onset. 



2.2.3 The effect of environmental noise 

In figure [2|^B) we saw that the optimal width increases with cr^, the prior environmental 
uncertainty. Since external noise adds further uncertainty to the environment, we expect 
the optimal tuning functions to broaden with increasing noise levels, but the exact effect of 
noise is difficult to predict because even in the case of additive Gaussian noise with standard 
deviation the expression for the MMSE (I17p becomes slightly more complex. The optimal 
width in this case is plotted against effective time (figure [21(A)) and against prior standard 
deviation (figure [3]J^B)) for different values of noise levels. All curves are very well-fitted by 
the function 

where the average squared error of each fit is less than 1.3 x 10"'^. This result implies that 
the effect of noise boils down to increasing the prior uncertainty. Since the noise is additive, 
and is independent of the stimulus, the term ^ cr^. + cr^ is precisely the standard deviation 
of the noisy stimulus X = X + W, meaning that the optimal width for estimating X is the 
same as for estimating X. This might not seem trivial by looking at f[T7l) . where the two 
standard deviations are not interchangeable, but is nonetheless very intuitive: seeing that 
all spikes depend only on X, it is the only quantity that can be estimated directly from the 
spike trains, whereas X is then estimated solely from the estimator of X . Therefore, optimal 
estimation of X necessitates first estimating X optimally. 
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Figure 3: Optimal tuning functions width in the presence of noise, for different values of 
parameters: (A) cr^ = 1 (solid), = 2 (dashed) and = 3 (dotted); (B) tes = 0.1 (solid), 
tes = 1 (dashed) and tes = 10 (dotted). 

An indication for the symmetric role played by the prior and the noise, can be seen in 
figure [3l^A), where the dashed red curve {ax = 2, cr„, = 1) merges with the solid green curve 
{a^ = 1, a^j = 2). Note also that the greatest effect of noise is observed when the prior 
standard deviation is small, whereas for large the relative contribution of noise becomes 
more and more negligible (figure |3|^B)). 

2.2.4 The effect of multimodality 

Integration of sensory information from different modalities (e.g. visual, auditory, somatosen- 
sory) has the advantage of increasing estimation reliability and accuracy since each channel 
provides an independent sampling of the environment (we focus here on the simple case where 
tuning functions are bell-shaped in all modalities). In the absence of noise, this improvement 
is merely reflected by an increased number of observations, but the main advantage manifests 
itself in the noisy case, where multimodal integration has the potential of noise reduction 
since the noise variables in the two modalities are independent. Considering two sensory 
modalities, denoted by v and a, and indexing the respective parameters of each modality 
by V and a, we provide a closed form expression for the MMSE in f[TH|) . When integrating 
the observations, the spike trains in each modality are weighted according to their reliabil- 
ity, reflected in the predictability of the stimulus based on the estimated input (related to 
o'w,v, o'w,a) and in the discriminability of the tuning functions (related to a^, aa)- For instance, 
when "visual" noise has infinite variance or when "visual" tuning functions are fiat, the spike 
trains in the "visual" pathway do not bear any information about the stimulus and are thus 
ignored by the optimal decoder. Indeed, when substituting o"^ „ = oo or = oo in (llSp . it 
reduces to (fT7|) with a^.a in place of a, cr^. Note that in the absence of noise the MMSE 
reduces to E[(cr~^ + a~'^{Yy + y^)) ] , which is identical to the unimodal expression for the 
MMSE with twice the expected number of observations {Y^ + ~ Pois(a2teff)). 
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Figure 4: (A) Optimal tuning functions widths in different sensory modalities as a function 
of the ratio between the noise standard deviations of the underlying physical phenomena 
{tes = 1, cTx = 1 and auj,v = 0.5). (B) Optimal tuning functions width in symmetrical 
bimodal and in unimodal settings {tea = 1 and ax = 1). The numbers in parentheses denote 
sizes of populations. 



In a bimodal setting, the optimal width in each modality adapts itself to the standard 
deviation of the noise associated with that modality. To see that, we fix aw,v = 0.5, and plot 
the optimal widths against the ratio of noise standard deviations (Tw,a/o'w,v (figure H)^ A)). 
When the "visual" channel is noisier {(Jw^a/<^w,v < 1)) the optimal "auditory" tuning functions 
are narrower than their counterparts, and when the "auditory" channel is noisier the situation 
is reversed. Interestingly, even when the "visual" noise variance is fixed, the optimal width 
in the "visual" modality is slightly affected by the "auditory" noise variance. 

In the symmetric case we can easily compare the unimodal and bimodal settings. As- 
suming that the number of tuning functions in each modality is the same, integrating two 
modalities doubles the expected number of observations, naturally resulting in a smaller opti- 
mal width (figur^U^B)). When the overall number of tuning functions is maintained constant, 
splitting them into two populations of equal size in each modality is preferred in terms of 
MMSE (results not shown), because the bimodal setting also has the advantage of partial 
noise-cancellation. But what can be said about the optimizing width in each setting? In 
figure m^B) we observe that 2M optimal tuning functions in a single modality are narrower 
than M + M optimal tuning functions for two different modalities. What is more intrigu- 
ing is the fact that single-modality optimal tuning functions are still narrower than their 
double-modality counterparts even when they are outnumbered, as long as the noise stan- 
dard deviation is sufficiently large. 

The latter prediction may be related to experimental results dealing with multisensory 
compensation. Neuronal tuning functions are observed not only in the sensory areas but also 
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in associative areas in the cortex, where cortical maps are adaptable and not strictly de- 
fined for a single modality. iKorte and Rauscheckerl ( 1l993l ) found that tuning functions in the 
auditory cortex of cats with early blindness are narrower than those in normal cats. We hy- 
pothesize that this phenomenon may subserve more accurate decoding of neural spike trains. 
Although optimal auditory tuning functions are wider in the absence of visual information 
(figure m^B)), our results predict that if the blindness is compensated by an increase in the 
number of auditory neurons (for instance, if some visual neurons are rewired to respond to 
auditory stimuli), the auditory tuning functions will in fact be narrower than in the case of 
bimodal sensory integration. Indeed, neurobiological experiments verify that the specificity 
of cortical neurons can be modified und er conditions o f sens ory deprivation: auditory neurons 
of deaf subjects react to visual stiinuli (IFinnev et al.l (I20011)') and vi s ual ne urons of blind sub- 
jects react to sounds (I Weeks et al.l (|2000[ ): iRauschecker and Harris! (Il983l )). We predict that 
even when the compensation is partial, namely when only a portion of the visual population 
transforms to sound- sensitive neurons, the optimal tuning functions in a noisy environment 
are still narrower than the "original" tuning functions in two functioning modalities (figure 



2.3 Dynamic optimal width 

Two important features of the decoding process pertain to the time available for decoding 
and to the prior information at hand. In principle, we may expect that different attributes 
are required of the optimal tuning functions under different conditions. We have seen in 
figure [2] (see also ([T])) that the optimal width increases with the initial uncertainty and de- 
creases with effective time. In this case the width is set in advance and the quantity being 
optimized is the decoding performance at the end of the time window. We conclude that 
tuning functions should be narrower when more knowledge about the stimulus is expected 
to be available at time t. However, in realistic situations the decoding time may depend on 
external circumstances which may not be known in advance. It is more natural to assume 
that since observations accumulate over time, the tuning functions should adapt in real time 
based on past information, in such a way that performance is optimal at any time instance, as 
opposed to some pre-specified time. To address this possibility, we now allow the width to be 
a function of time and seek an optimal width function. Moreover, it seems more functionally 
reasonable to set the MMSE process as an optimality criterion rather than the MMSE at 
an arbitrary time point. As a consequence, the optimal width function may depend on the 
random accumulation of observations, namely it becomes an optimal width process. 

We begin by analyzing the simple case of a piecewise constant process of the form 

a{t: iAt<t<{i + l) At) = a^, (i = 0, 1, . . . ) 

and search for optimal (possibly random) variables {ai}'^i. At each moment, we assume 
that a{t) is large enough with respect to Ac so that the mean firing rate of the population, 
^^Am(x,t), is independent of the stimulus. Therefore, the posterior distribution is now 

pWN.,^cexp(-i(ii^ + i: £ f^]]' 
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where ti(^rn) is the time of the z(m)-th spike generated by the m-th sensory celL Summing 
over all spike-times in each At-long interval, it is simple to show that at the end of the i^-th 
interval the posterior density function is Gaussian with variance 



where Yi = Em(^{T+i)At ' ^Ht) ~ Pois(«,Ateff) for all i and Atefr = (v^/Ac) An^axAt. 




The value of is obtained by minimizing E[((T^^ + a~'^YQ)~^]. When "choosing" the 
optimal width for the second interval, the population can rely on the spikes observed so far 
to modify its width so as to minimize E[(cT]^^ + a~'^Yi)^^]^ where ajf^ = cr~^ + Oq'^Yq. By 
recursion, at time t = iAt the optimal width parameter is determined by taking into 
account all spikes in the interval [0, (z — l)At] and minimizing E[{a-^ + a-^Yi)-% where 



reflects the effective uncertainty at time iAt after integrating prior knowledge and informa- 
tion from observations. We see that the optimal width process is monotonically nonincreasing 
(since the sequence {erf} is nonincreasing), and the rate of reduction in tuning functions width 
is affected by the rate of spike arrivals. Examples for both slow and fast spiking processes 
can be seen in figure [5]^A), where the average optimal process (averaged over 1000 trials) is 
plotted as well. In general, an optimal width process is unlikely to drastically deviate from 
the average due to an internal "control" mechanism: if few spikes were obtained then the 
width is still relatively large, increasing the chances of future spikes, whereas multiplicity of 
spikes results in small width, limiting the probability of observing more spikes. In the limit 
At — 0, the optimal width process starts exactly at cr^ and jumps to a value \/2-times smaller 
than its previous value at the occurrence of each spike (see Section S4 in the Supplementary 
Material for proof). 

When we examine the average optimal width process for different values of prior standard 
deviation (figure [5t^B)), we see that prior to the encoding onset it is always best to keep the 
tuning functions practically as wide as the probability density function of the stimulus (i.e., 
at t = 0, before any spikes are obtained, set a°P*(t = 0) = for all realizations). The 
average rate of dynamic narrowing is then related to the initial uncertainty, where the fastest 
narrowing occurs in the most uncertain environment. Note that under the conditions stated in 
Section |473| the mean population activity ^mix) is proportional to the width parameter 
a. Thus, figure [5]^B) can be interpreted as refiecting a dynamic decrease in population 
activity as a function of time. Furthermore, there is an excellent fit between the average 
optimal process to the simple function 
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A 



B 




Figure 5: (A) Sample realizations of optimal width processes for fast (bottom) and slow 
(top) spiking processes. The average over 1000 trials is plotted in the middle {ax = 1 and 
Ateff = 0.1). (B) Average optimal width processes for different values of prior standard 
deviation, all coinciding with the function a{t,ax) = (|4ff(i) + 




Figure 6: The MMSE associated with four different populations, each one tuned to a dif- 
ferent optimality criterion. The MMSE was assessed using Monte Carlo methods. Inset: 
Enlargement for late times. 
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Implementation of an optimal width process seems biologically implausible, due to its 
discontinuity and its perfect synchronization with the spikes. It is more likely for biological 
tuning functions to adopt the average optimal width process strategy, namely to automati- 
cally commence a dynamic narrowing at the beginning of an encoding period, independently 
of the generated spikes. But is the average optimal width process still more efficient than 
constant width? To address this question, we simulated the dynamics of four populations: 
two with constant widths tuned for optimal performance at the middle and at the end of the 
decoding time window, respectively, one with a piecewise-optimal width process and another 
one with a width function corresponding to the average optimal width process. As expected, 
each of the constant-width populations outperforms the other at the time for which it is 
tuned (figure [6]). The dynamic-width populations significantly outperform them both for 
short times, where for later times the differences become negligible. Therefore, implementa- 
tion of dynamic width offers a clear advantage over operation with constant width, especially 
since there is seldom a reference time of interest to which the tuning functions could be 
tuned in advance. Interestingly, the predetermined width function is as good as the optimal 
width process for short times and even outperforms it for longer times, meaning that the 
performance is not being compromised when a predetermined width function is employed in 
place of the optimal width process. 



Our prediction that tuning functions should dynamically narrow during t he course of 



encod ing a stimulus, coincides with some reported experimental phenomena. I Wang et al. 
(120051 ) recorded the activity of auditory cortical neurons in awake marmoset monkeys in 
response to a sound stimulus, which is the preferred stimulus of only one of the neurons. 
When comparing the activity profiles of the two neurons, they found that both fire intensively 
immediately following stimulus presentation, but only the neuron which is tuned to the 
stimulus maintained high activity throughout stimulus duration. Such behavior is equivalent 
to (and can stem from) dynamic narrowing of receptive fields. Widespread onset responses 
(caused by initially large receptive fields) subserve fast detection of the occurrence of sounds, 
whereas sustained firing in specific neurons facilitates extraction of the exact features of 
the sound. In the context of one-dir nensional tuning f unctio ns, an illustrative example for 
dynamic narrowing was observed by iGutfreund et al.l (l2002i ) in the external nucleus of the 
inferior colliculus in anesthetized barn owls. The authors recorded from a population of 
neurons tuned to a short range of inter-aural time differences (ITD) in response to stimuli 
with different ITD values. While initial responses were intensive for all stimuli, intensive 
ongoing responses were restricted to stimuli within the ITD range preferred by the population, 
implying a dynamic reduction in tuning function width. A similar phenomenon was also 
obtained for single neurons. 



3 Discussion 

In this paper we have studied optimal encoding of environmental signals by a population of 
spiking neurons, characterized by tuning functions which quantify the firing probability as 
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a function of the input. Within the framework of optimal Bayesian decoding, based on the 
mean square reconstruction error criterion, we investigated the properties of optimal encod- 
ing. Based on the well known inequality, bounding the MMSE by the Bayesian Cramer-Rao 
lower bound ([7]), we tested the hypothesis that the tuning function width minimizing the 
MMSE can be recovered by minimizing the (asymptotically tight) lower bound. This hidden 
assumption was implied in pr evious studies dealing with cl a ssical, non-Bayesian, estima - 
tors based on neural deco ding ( ISeung and SompolinskyI ( 119931 ) ; IZhang and Sejnowskil (119991 ) ; 
Brown and Backer Unfortunately, as argued in Section [2l the predictions were often 

incompatible with the assumptions required in the derivation of the bound, and thus could 
not be utilized to compare bound-based predictions with results based on direct minimiza- 
tion of the MMSE. It should be noted that in the non-Bayesian setting an optimal estimator 
does not necessarily exist, because the performance of each estimator depends on the stim- 
ulus, which is an unknown parameter. For instance, X(Nj) = a is the optimal estimator if 
and only if X = a. The above-mentioned studies overcame this problem by assuming that 
the tuning functions are dense enough so that the population's Fisher information can be 
approximated by an integral and thus becomes independent of the stimulus. They all con- 
cluded that in one dimension estimation performance improves with narrower tuning curves. 
However, the underlying assumption is valid only when the width is large enough with re- 
spect to Ac, and obviously breaks down as a — )■ 0. Therefore, values in the proximity of 
a = are not part of the solution space, and limQ,_^o <J{X) cannot be estimated using the 
suggested approximation. Moreover, realistic environments are dynamic and it seems more 
reasonable to model their features as random variables rather than unknown parameters. 
This means that optimal tuning curves may be more fruitfully examined within a Bayesian 
framework. Indeed, neurobiological evidence indicates that the nervous system can adapt to 
changes in the statistical natu r e of t h e environment a nd its random features at multiple time 
scales f e.glPettet and GilbertI (Il992[ ): iBrenner et all Jioooh : lOraeoi et all (l2000h : IPean et al 
(l2005l ): lHosova et all (120051 )7 



Starting from the Bayesian Cramer-Rao lower bound ([7]) we have studied predictions 
about tuning function properties based on this, asymptotically tight, bound. As we demon- 
strated, this bound-based approach has little value in predicting the true optimal tuning 
functions for finite decoding time. Even though the performance inequalities hold in all 
scenarios, optimizing the bound does not guarantee the same behavior for the bounded 
quantity. Moreover, an important, and often overlooked observation is the following. Per- 
formance bounds might be too model-dependent, and when the model is misspecified (as is 
often the case) they lose their operative meaning. For instance, when the model is incorrect, 
the MMSE does not converge asymptotically t o the BCRB, even though the estimator itself 
might converge to the true value of the state ( White ( 19821 )). Thus, given that models are 
often inaccurate, the danger of using bounds in apparent even in the asymptotic limit. 



By deriving analytical expressions for the minimal attainable MSE under various scenar- 
ios we have obtained optimal widths directly by minimizing the MMSE, without relying on 
bounds. Our analysis uncovers the dependence of optimal width on decoding time window, 
prior standard deviation and environmental noise. The relation between optimal width and 
prior standard deviation may very well explain physiological responses induced by attention. 
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as noted in Section 12.2^ where further experimental evidence supporting the predictions of 
this paper was discussed. In particular, the work of iKorte and Rauscheckerl ( 1l993l ) related 
to tuning functions in the auditory cortex of cats with early blindness was discussed, and 
specific predictions were made. We also showed that when the constant-width restriction 
is removed, tuning functions should narrow during the course of encoding a stimulus, a 
phenomenon which has already been demonstrated for Al neurons in marmoset monkeys 
(IWang et al. and in the external nucleus of the inferior colliculus in barn owls (e.g., 



Gutfreund et"aD fl2002[ )) 



In summary, in this paper we examined optimal neuronal tuning functions within a 
Bayesian setting based on the minimum mean square error criterion. Careful calculations 
were followed by theoretical results, based on a natural criterion of optimality which is com- 
monly employed, but which has seldom been analyzed in the context of neural encoding. Our 
analysis yielded novel predictions about the context-dependence of optimal widths, stating 
that optimal tuning curves should adapt to the statistical properties of the environment - in 
accordance with ecological theories of sensory processing. Interestingly, the results predict 
at least two time scales of change. For example, when the statistical properties of the envi- 
ronment change (e.g., a change in the noise level or prior distribution) the optimal encoding 
should adapt on the environmental time scale in such a way that the tuning function widths 
increases with noise level. However, even in the context of a fixed stimulus, we predict that 
tuning functions should chang e on the fast time scal e of sti mulus presentation. Interestingly, 
recent results, summarized by iGoUisch and Meisterl (120101 ). consider contrast adaptation in 
the retina taking place on a fast and slow time scale. The fast time scale, related to "contrast 
gain control" occurs on time scales of tens of millisecond, while the slow time-scale, lasting 
many seconds, has been referred to as "contrast adaptation". While our results do not di- 
rectly relate to contrast adaptation, the prediction that optimal adaptation takes place on 
at least two, widely separated, time scal es is promising. Moreover, the circuit mechanisms 
proposed in (IGollisch and Meisterl (120101 )) may well subserve some of the computations re- 
lated to tuning function narrowing. These results, and the above-mentioned experimentally 
observed phenomena, suggest that biological sensory systems indeed adapt to changes in 
environmental statistics on multiple time scales. The theory provided in this paper offers a 
clear functional explanation of these phenomena, based on the notion of optimal dynamic 
encoding. 



4 Methods 

As stated in Section [2] in this paper we employ the theoretical framework of optimal signal 
estimation in the context of neural encoding and decoding, in an attempt to find the optimal 
tuning functions that facilitate optimal estimation of X from the observations (the neural 
spike trains). We refer the reader to the beginning of Section [2] for the relevant problem setup 
and definitions. 
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4.1 Optimal decoding of neural spike trains 



As is well known (Ivan Treed . 119681 ). the optimal estimator minimizing the MSE is given by 
X°P'^(Nt) = E[X|Nt]. This estimator can be directly computed using the posterior distribu- 
tion, obtained from Bayes' theorem 



p{x\Nt) 



p{x)P(Nt\x) 



M 



cp {x) e 



n (^) ^) 



NV 



(2) 



m=l 



where c is a normalization constant. We comment that here, and in the sequel, we use the 
symbol c to denote a generic normalization constant. Since such constant will play no role 
in the analysis, we do not bother to distinguish between them. For analytical tractability we 
restrict ourselves to the family of Gaussian prior distributions X ~ p{-) = M{iixi(^'i)i and 
consider Gaussian tuning functions 



^m{x) — An 



(m = 1, . . . ,M) 



(3) 



which i n many cases set a fair approximation to biological tuning functions (e.g. iPouget et al 
(!2nnn[ ): lAnderson et"Zl (|200o|)). From (E]) we have 



{x-p.x) 

p (a;|N() = ce ^.ri 
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In the mult i- dimensional case X ~ jV{fi^, S^) and the Gaussian tuning functions are of the 
form 

Xm{x) = A^axe-5(''-'=-)"^- (m = 1, . . . , M). (5) 



4.2 Bayesian Cramer-Rao bound 

The Fisher information is defined by 



J{X)=E 



^lnP(Nt I X) 

dX ^ * I ^ 



(6) 



If X is deterministic, as in the classic (non-Bayesian) case, and B{X) = E[X-X] is the esti- 
mation bias, then the error variance of any non-Bayesian estimator X(Nt) is lower bounded 
by (l + ^B{X)Yj-\X), which is the Cramer-Rao bound fIvan TreesI fll968h l The exten- 
sion to the Bayesian setting, where X is a random variable with probability density function 
p{-), often referred to as the Bayesian Cramer- Rao bound (BCRB), states that 



E 



X -X 



> 



E[j{x)]+i{p{x)y 



(7) 



where l(p ( x)) = E[(^ Inp(X))^] and the bounded quantity is the MSE of any estimator 
fIvan TreesI fll968[ )l Note that the expectation in ([7]) is taken with respect to both the obser- 
vations Nt and the state X, in contrast to the non-Bayesian case where the unknown state 
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X is deterministic. The BCRB is asymptotically tight, namely in the limit of infinite number 
of observed spikes (t — )■ oo) the MMSE estimator satisfies equation ([7]) with equality. 

The second term in the denominator of ([7]) is independent of the tuning functions, and 
in the case of a univariate Gaussian prior is given by 



X(p(x)) = E 



at 



2 • 



The expected value of the population's Fisher information (derived in Section SI in the 
Supplementary Material) is given by 



M 



1 "m, (a^ + crl) 



2^V2 



In the multi-dimensional case the right hand side of ([7]) is replaced by the inverse of the 
matrix J = 'E[J{'K)] + X(p(x)), where 



^(X) 
X(p(x)) 



E 
E 



(VlnP(Nt I X))(VlnP(Ni | X)) 



(VlnP(X))(VlnP(X))^ 



T 



and the left hand side is replaced by the error correlation matrix R = E[(X — X)(X — 
X)^] . The interpretation of the resulting inequality is twofold. First, the matrix R — J^^ is 
nonnegative definite and second. 



E 



Xi, — Xfc 



> (J-') 
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(9) 



where d is the dimension of X (Ivan Treed (119681 )). 
It is straightforward to show that in this case 

X {p (x)) = E fc^ (X - (X - 



The expression for E [J' (X)] is similar to ([8]) and is derived in Section SI in the Supplementary 
Material, yielding 



M 



m=l 
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X 



where = (c^ - /x^)'^(A^ + S^.)-i(c^ - /x^). 



(10) 
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As noted in Section [TT| iBethge et al.l (|2002[ ) were the first to address the shortcomings of 
lower bounds on the MSE in a Bayesian setting. These authors stressed the fact that perfor- 
mance bounds are usually tight only in the limit of infinitely long time windows, and cannot 
be expected to reflect the behavior of the bounded quantities for finite decoding times. They 
observed that in the classical approach, for any X, the conditional MMSE for unbiased esti- 
mators asymptotically equals 1/J'{X). By taking expectations on both sides they obtained 
E[l/j7'(-'f)] as the asymptotic Bayesian MMSE. Unfortunately, as noted following 0?]), the 
asymptotically tight lower bound in the Bayesian setting is [E [JT" (X)] +X(p(x))]~^ which 
can easily be seen, using Jensen's inequality, to be smaller than E[l/j7'(X)]. The reason for 
this subtle discrepancy is that the optimal Bayesian estimator can make use of the prior and 
may be conditionally biased, while any non-Bayesian estimator does not have access to the 
prior distribution. Thus, the Bayesian error may be smaller than E[l/j7'(X)] 



4.3 Analytical derivation of the MMSE 

In this Section we proceed to establish closed form expressions for the MMSE. In order to 
maintain analytical tractability we analyze the simple case of equally spaced tuning functions 
{cm+i — c„i = Ac) with uniform width = «)• When the width is of the order of Ac or 
higher, and M is large enough with respect to ax-, the mean firing rate of the entire population 
is practically uniform for any "reasonable" X. As shown in Section S2 in the Supplementary 
Material, the sum \m.{x) is well approximated in this case by A (a) = -\/27r-^Amax- Under 
these conditions the posterior (jl]) takes the form 



p {x\Nt) = cexp < — 




fill 



implying that X|Nt ~ A/'(/i, (X^), where 



a 



1 , T^mN't 



Recalling that the spike trains are independent Poisson processes, namely ^^^^1^ ~ 
Pois( Am(X)t) , and in light of the above approximation, Y = NJ^ is independent of 

X and F ~ Pois(atefT), where tes = ^K^^t. Since X°P'{Nt) = E[X|NJ = fi, the MMSE is 
given by 



MMSE = E 



X 



^optV 



E[E [{X-fif\Nt]] =E[a^] 



and thus we get an explicit expression for the normalized MMSE, 
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In the multivariate case we assume that the centers of the tuning functions form a dense 
muhi- dimensional lattice with equal spacing Ac along all axes, and ( ITT]) becomes 



p (x|N,) = cexp |^(x - ^i,f (x - + Nr (x - c^f (x - c. 



namely X|N( ~ A/'(/i, S), where 



(13) 
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In this case Y = Em^r ~ Pois( VM^fff). ^efr - (v^/Acj'^Amaxt (see Section S2 in the 
Supplementary Material), and 
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(15) 



4.4 Incorporation of noise 

Suppose that due to environmental noise the value of X is not directly available to the sensory 
system, which must respond to a noisy version of the environmental state - X. For simplicity 
we assume additive Gaussian noise: X = X + W, W ^ A/'(0, al). For a dense layer of sensory 
neurons the joint posterior distribution of the environmental state and noise is 

p (a;, w) P (Nt|a;, w) p (x) p (w) P (N^ |x + w) 



p (x, w\Nt 
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(x + in-cm) »rm 
njT 2^ ^' 

CC ^ 6 ™ 6 



2^ 



which is Gaussian with respect to W, where o"^ = (cr,^^^ + a ^Xlm^t") ^- Integration over 
w gives the marginal distribution 



p {x\lS5t) = ce 
which is Gaussian with variance 



TV," 
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1 F 
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where F = A'^™ ~ Pois(atefr) as before. The MMSE is computed directly from the 
posterior variance: 



MMSE = E [<j2] = E 



Y 
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al + alY 



(17) 
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Note that when cr^„ — > oo the sensory information becomes irrelevant for estimating X, and 



indeed in such case MMSE — t- o"^. 



4.5 Multisensory integration 

Consider the case where the environmental state is estimated based on observations from 
two different sensory modalities (e.g., visual and auditory). The spike trains in each channel 
are generated in a similar manner to the unimodal case, but are then integrated to yield 
enhanced neural decoding. All quantities that are related to the first and second modalities 
will be indexed by v and a, respectively, although these modalities are not necessarily the 
visual and the auditory. 

Each sensory modality detects a different noisy version of the environmental state: = 
X + Wy in the "visual" pathway and Xa = X + Wa in the "auditory" pathway. The noise 
variables are Gaussian with zero-mean and variances a^,^ and 0"^^, respectively, and inde- 
pendent of each other as they emerge from different physical processes. The noisy versions of 
the environmental state are encoded by + Ma sensory neurons with conditional Poisson 
statistics: 

Fois ( Xl,{X,)t) , (m = l,...,M,) 
(m = l M„ 



iV™'"|X,~Pois A;^,(xjt 



where the tuning functions in each modality are Gaussian with uniform width. 



Xl. (x) = Xt^^^e 
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We assume that the tuning functions are dense enough {a^ ^ Ac^,, ^ Aca) and equally 
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The full expression for the MMSE in this case is 
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Pois{a,t:^) and ^ J^^N^ 



Pois(aatefr) (the derivation 



appears in Section S3 in the Supplementary Material). 
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Supplementary Material 



SI. Expectation of population Fisher information 

In this section we calculate the expected value of the Fisher information for a population of 
sensory neurons following Poisson spiking statistics, having Gaussian tuning functions of the 
form 



Xm (x) = AmaxC , (m = 1, . . . , M) 

and encoding a stimulus with Gaussian prior distribution {X ~ p{-) = N'{^xi <^1))- We start 
by calculating the population's local Fisher information, taking into account the conditional 
independence of the Poisson processes: 
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Due to the conditional independence the expectation is separable for every m 7^ n, and equals 
because E[iV™|X] = A^(X)t. Thus, 

var(Aft'"|X)=A,„(X)i 
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Taking the expectation over the prior p{x) yields: 
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where 

By defining 1^^ ~ M{fim, cr^), = 1, 2, ... , M), the integral becomes E[F^ — 2cm,^m 

Cm] = + /"m - 2cm/irri + c^, and by Substituting /i^, we get 

. r (cm-Mz)^ 

E [J (X)] = J2 ^ [^^ i^l + oL) + oL {Cm - ^^.f] . 
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In the mult i- dimensional case X ~ J^{fi^, S^.), the tuning functions are of the form 
\m (x) = A^axe-5(^-^-)"^™'(''-^'"), (m = 1, . . . , M) , 



and the population's Fisher information is given by 

M 



E 



M 



^ (iVr - (X) t) A^l (X - c^) ^ {N- - A„ (X) t) Al' (X - c„) 

\m=\ J \n=l 

M M 

mr - \m (x) t) (ivr - k (x) t) \x] a-^ (x - c„) (x - c^f a-^ 

m=l n=l 
M 

A.axt J2 {(X - c„) (X - c^f e-^(^--)"^- (^"-)} a;;,^ 



m=l 



By virtue of the linearity, we calculate the expected value of J7'(X) with respect to X by 
taking expectation over the expression in curly brackets: 



E{.} 



(X - c^) (X - Cm) e 



e 2 



5?" 



fx - c™,) fx - c. 



(iX 



where 



c / , , \ H'm.l-''m ■^^mH'm ^ ^m^mj 



Using the identity {A'^ + fi"^)"^ = A{A + flSearld fll982[ )) and the symmetry of all 

matrices it is a simple mathematical exercise to show that 



M 



E [J (X)] = A^axt XI ^"'^'"^Sfe + S,)-' (Am + S,) + Am (Cm, - /Xj (Cm - 



m=l ^|Am + S^l 

and ^m = (Cm - Atx-)^(^m + ^x)~^{C-,n " A^x)- 



S2. Uniform mean firing rate of population 

In this section we approximate the mean firing rate of the entire population of sensory cells 
for the case of equally spaced tuning functions with uniform width. First, consider the case 
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M = oo, for which the population's mean firing rate is given by 

oo 

A {X, a) = Amax ^ 



oo 

e~2 I a , 



The function A(a;, a) is continuous with respect to x, and therefore when the spacing goes to 
0, Ac ■ A(x, a) converges to the Riemann integral 

oo 

lim Ac ■ A(a;, a) = Amax / e~2(~) dc = AmaxV^STro! 

Ac-5.0 J 



-oo 



which is independent of x. This means that for any e > there exists 6 > such that if 
Ac/ a < 6 then the relative ripple (digression from uniformity) in the total firing rate 

max A(x, a) — min A(x, a) 
min A(x, a) 

is smaller than e. 

Now consider the case of finite M, 

A (x, a; M) = A,nax 2^ g-^l^^J . 

Since the limit at M — )■ oo exists, for every x and e > there exists Mi;{x) > such that for 
all M > Ms{x) the difference between the infinite sum to the finite sum is less than e, and 
therefore X{x, a; M) can be well approximated by 



m=— oo 



/ X Amax f -ifi^)^ , AniaxV27ra A , / n 

A,n (x) ^ -— / e 2l . J rfc = — = A (a) 



To quantify the desired ratio between the width and the spacing we plot the relative 
ripple (figure [T]^ A)). For a > 0.583Ac the relative ripple falls down below 0.5%, rendering 
the approximation A(x, a; M) = X{a) quite accurate. Note that the approximation fails near 
the most extreme tuning functions, but for a sufficiently large population this would not have 
any effect on the results, because the probability of X falling into these regions would be 
vanishingly small for any rapidly-decaying prior. 

The proof for the multi-dimensional case is similar, where the population's mean firing 
rate is approximated by 

^ oo / \ d 



m=—oo 

— oo 



The conditions which make the approximation valid vary with the dimensionality of the 
problem - as the dimension increases the width-spacing ratio must be larger in order to 
maintain the same approximation accuracy. For instance, in two dimensions where A = al 
(figure [TI^B)), the relative ripple falls down below 0.5% only for a > 1.446Ac . 
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A 



B 




a/Ac oi/Ac 



Figure 7: The relative ripple in the mean firing rate of a population of neurons with Gaussian 
tuning functions: (A) in single dimension, (B) in two dimensions. 

S3. MMSE for multisensory integration 

In this section we derive the MMSE for the case of decoding bimodal spike trains. Similarly 
to the unimodal noisy case, it is straightforward to show that the joint posterior distribution 
of the environmental state and noise variables is given by 

9 2 2 A/„ i", I „, \2 Ma (•^ I „. \2 



p{x,Wy,Wa\^1,'^t) = Ce g 2<T„,„g 2<T„,„g 2a„ g 2a, 



ce 2<t2 2<t2_„ t ^ 2\^<T„.a „=i 



which is Gaussian with respect to Wa-, seeing that the argument of the rightmost exponent is 



w,a 



m=l / \ 7X1=1 \m=l 



where ^ ~ {'^w\ + '^a ^ Xlm Integration over w;^ gives the joint posterior distribu- 

tion of the state and the "visual" noise: 

Mv f^,,., -iS 1 I (fmZ^il?. /vr™'" ;;-2 ( (fmlj^l Ar™-" 



5^1 2«2 Aj +<7„_, ^ ^^^t ) -2 ^+ 2. ^2 J^t 



which is Gaussian with respect to W^, seeing that the argument of the rightmost exponent is 
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where o"^^ = {<J^\ + a^'^Ylm-^t^'^) ^- Integration over gives the marginal distribution 
of the state: 



, .2 Ma l^a I Ma a ^ 



m — 1 a 



p(x|N^,N^) = ce 
which can be shown to be Gaussian with variance 



)KIi; ( V \'2 / 1} 



AT" 



at 



at 



a^ 



By defining = Y.-m^r'" ~ Pois(ai,t^fl:) and Ya = Y^m^r^" ~ Pois(aat^g), we get a 
relatively simple expression for the MMSE: 



MMSE 



E 



E 



1 + ^ + ^ 



a' Y' 



a' Y' 

w,a a 



+ 



„ al {al + al^^Y^) al [al + al^^Ya) 
Y Y 

^ V ^ -'a 



-1 



S4. Dynamics of the optimal width process 

In this section we prove that for the optimal piecewise-constant width process defined in 
Section 2.3, 



and 



lim 



a,; 



lim ao = <^x 

At-s>0 



0,1, 



M^oai+i 

where ai is the value of the width process after obtaining a total of i spikes. 

For short intervals, the Poisson random variable 1^* = Ym ^At converges in probability 
to a Bernoulli random variable with success probability aoAtes- The MMSE at t = At then 
takes the form 



MMSE 



2P {YAt = k) = ^ \ 



+ a~n 



a. 



-2 



a. 







a. 



-1 



1 - Ateff^ 

an ' 



The MMSE is minimized when a^^/ {a^^ + a~'^) is maximized, and thus Og^* = ax- Given the 
spikes history at t = At, if Ya* = then the MMSE expression for the next interval remains 
unchanged, whereas if Y^t = 1 then the MMSE at the end of the next interval is given by 



MMSE = E 



E 



_a^' + iY2At - I) a^ 
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where ai = cr^/ a/2 is the effective uncertainty after observing a single spike, and Y2At — 1 is 
a Bernoulh random variable with success probability aiAtes- Applying the same analysis as 
before, a°^^ = ai = chq^^ / \/2. By recursion, one can see that as long as k spikes were observed 
the optimal width for the next interval remains a'j^^, and if another spike is observed the 
optimal width for the next interval becomes Q!^^\ = / \f2. 
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